12/01/2019

14.3 Cluster Analysis

Unsupervised learning problem

  • Have some set of objects or points
  • Want to group or cluster these objects

Goal:
Want objects clustered together to be more similar, than objects clustered differently

The catch:
We do not know the true structure!
Key difference to supervised learning

Basic notation

Some set of points, \(x_i\), where \(i = 1, \dots, N\)
Each of these points have some attributes, \(j\), where \(j = 1 \dots, p\)

Example dataset: Iris

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

What does it mean to be similar?

Define a dissimilarity between each attribute: \[d_j(x_{ij}, x_{i'j})\] Definte a dissimilarity between points: \[ D(x_i, x_{i'})= \sum_{j=1}^p d_j (x_{ij}, x_{i'j}) \] Common choice is squared Euclidean distance: \[D(x_i, x_{i'}) = \sum_{j=1}^p d_j(x_{ij},x_{i'j}) = \sum_{j=1}^p (x_{ij} - x_{i'j})^2\]

Attribute weights

\[ D(x_i, x_{i'}) = \sum_{j=1}^p w_j \cdot d_j(x_{ij}, x_{i'j}); \quad \sum_{j=1}^p w_j = 1 \]

  • setting \(w_j\) to \(\tfrac{1}{p}\) does not necessarily mean all attributes provide equal contribution

  • an alternative is to weight by the overall average contribution of each attribute, so let \(w_j = \frac{1}{\bar{d_j}}\) where

\[ \bar{d}_j = \frac{1}{N^2}\sum^N_{i=1}\sum^N_{i'=1}d_j(x_{ij}, x_{i'j}) \]

  • chosing weights/normalising should be decisions made with respect to the subject matter

Type of dissimilarities

Different types of dissimilarities:

  • Quantitative (eg. Euclidean)
  • Ordinal (eg. Rank based - A,B,C,D,E)
  • Categorical (eg. Sentiment - happy, sad, angry)

Dissimilarities are not necessarily distances

  • don't need to satisfy triangle inequality
  • the proximity matrix or distance matrix may not be symmetric

Clustering algorithms

  • combinatorial - work on the data with no reference to an underlying probability model

  • mixture models - assume data is iid sampled from some population, each cluster corresponds to one component of the mixture density

  • mode seekers (bump hunters) - non-parametric approach, attempting to estimate modes of the distribution function

K-means

Some set of points, \(x_i\), where \(i = 1, \dots, N\)

Prespecify a number of clusters \(K < N\)

Index each cluster by \(k \in \{1, \dots, K\}\)

Each observations can only belong to one cluster

Define an encoder function, \(k = C(i)\), so \(x_i\) is assigned label \(k\)

Assume there is some 'optimal' encoder \(C^*(i)\) that captures the true underlying structure

How do we define optimal?

"Total" point scatter = "Within" point scatter + "Between" point scatter

\[ \begin{aligned} T(C) &= \frac{1}{2} \sum_{i=1}^N\sum_{i'= 1}^N D(x_i, x_i') \\ &= \frac{1}{2}\sum_{k=1}^K\sum_{C(i) = k}\left( \sum_{C(i') =k} D(x_i, x_{i'}) + \sum_{C(i') \neq k} D(x_i, x_{i'}) \right) \\ &= W(C) + B(C) \end{aligned}\]

  • Minimise \(W(C)\)

  • Maximise \(B(C)\)

  • Can't evaluate this for all possible encoders - get greedy

K-means assumptions

  • Squared Euclidean distance

  • Minimise \(W(C)\)

\[ \begin{aligned} W(C) &= \frac{1}{2}\sum_{k=1}^K\sum_{C(i) = k}\sum_{C(i')=k} \|x_i - x_{i'}\|^2 \\ &=\sum_{k=1}^K N_k \sum_{C(i) = k}\|x_i - \bar{x}_k \|^2, \end{aligned}\] where \(N_k\) are the number of points in the cluster with index \(k\) and \(\bar{x}_k\) is the mean vector.

Algorithm

  1. Generate a random assignments of points to clusters
  2. Find the mean of each cluster
  3. Update the assignment of points to clusters with respect to the new means
  4. Repeat this until steps 2 and 3 are unchanged.

Note:

Result will depend on our initial point assignment

Algorithm will converge to a local minima, but not necessarily global mimina.

Iris example

http://rpubs.com/Nitika/kmeans_Iris

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
iris.new<- iris[,c(1,2,3,4)]
iris.class<- iris[,"Species"]

Normalising

summary(iris.new)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
normalize <- function(x){return ((x-min(x))/(max(x)-min(x)))}
iris.new$Sepal.Length <- normalize(iris.new$Sepal.Length)
iris.new$Sepal.Width <- normalize(iris.new$Sepal.Width)
iris.new$Petal.Length <- normalize(iris.new$Petal.Length)
iris.new$Petal.Width <- normalize(iris.new$Petal.Width)

K-means in R

result <- kmeans(x = iris.new, centers = 3, nstart = 10) 
result$size
## [1] 50 61 39
result$centers 
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1    0.1961111   0.5950000   0.07830508  0.06083333
## 2    0.4412568   0.3073770   0.57571548  0.54918033
## 3    0.7072650   0.4508547   0.79704476  0.82478632
result$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [71] 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 3 3
## [106] 3 2 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3
## [141] 3 3 2 3 3 3 2 3 3 2

K-means in R

K-means in R

Total number of correctly classified instances are:
36 + 47 + 50 = 133

Total number of incorrectly classified instances are:
3 + 14 = 17

Accuracy = 133/(133+17) = 0.88 i.e the model has achieved 88% accuracy

##    iris.class
##     setosa versicolor virginica
##   1     50          0         0
##   2      0         47        14
##   3      0          3        36

How do we choose \(K\)?

Lots of methods - check out package, NbClust

Iris example contd.

Air Quality Example

http://rpubs.com/Nitika/kmeans_Airquality (What not to do!!!)

data("airquality")
str(airquality)
## 'data.frame':    153 obs. of  6 variables:
##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ...
##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ...
##  $ Wind   : num  7.4 8 12.6 11.5 14.3 14.9 8.6 13.8 20.1 8.6 ...
##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ...
##  $ Month  : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ Day    : int  1 2 3 4 5 6 7 8 9 10 ...

Air quality contd.

#Normalize the dataset 
normalize<- function(x){
  return((x-min(x))/(max(x)-min(x)))
}

# replace contents of dataset with normalized values
airquality <- normalize(airquality)

# apply k-means algorithmusing first 4 attributes and with k=3
result<- kmeans(airquality[c(1,2,3,4)], 3) 

Air quality contd.

Limitations

  • K-means not robust to outliers

  • Method is commonly phrased using squared Euclidean distance

  • Optimisation requires a notion of a mean/centroid

  • Standard implementation is in terms of points not distances

K-medoids

  • Optimise with respect to one of the points, instead of the mean

  • More robust to outliers

  • Implementation is in terms of either distances or points

PAM Algorithm

  1. Randomly select an initial set of \(K\) stations. These are the initial medoids, \(m_k\).

  2. Assign each station, \(x_i\), to its closest medoid, \(m_k\)

  3. For each cluster update the medoid. The new medoid is the station within that cluster such that minimises the sum of within cluster distances \[ m_k = \mathop{\mathrm{argmin}}\limits_{x_i \in C_k} \sum_{C(i') =k} D(x_i, x_{i'}). \]

  4. Repeat steps 2–4 until the medoids are unchanged.

Iris Example

library(cluster)
result_kmedoids_1 <- pam(x = iris.new, k = 3, diss = FALSE) 
result_kmedoids_2 <- pam(x = dist(iris.new), k = 3, diss = TRUE)
result_kmedoids_3 <- pam(x = iris.new, k = 3, 
                       diss = FALSE, metric = "euclidean")

Iris Example

Again how to choose \(K\)?

Let \(m_k\) be the medoid of cluster with label \(k\).
For a point \(x_i\), \(m_k\) is it's closest medoid.
Define \(m_{k'}\) to be the next closest medoid.

Silhouette coefficent: \[ s_i(k) = 1 - \dfrac{D(m_k,x_i)}{D(m_{k'}, x_i)} \] Average silhouette coefficient: \[ \bar{s}(K) = \dfrac{1}{N} \sum_{k = 1}^K \sum_{k = C(i)} s_i(k) \]

Average silhouette coefficient

Iris Example contd.

Example

Consider the \(\max\{D(x_i,x_j), 2\}\) as the clustering distance.

Density example

Limitations

  • still have to pick \(K\)

  • sensitive to point density

  • again, optimisation makes implicit assumptions about distance

  • greedy

  • slow for large numbers of points

Hierarchical clustering

  • Don't have to pick \(K\)

  • There is an implicit ordering or hierachy within the clustering

Dendrogram

hc <- hclust(dist(iris.new), method = "average")
plot(hc, hang = -1, labels = FALSE, col = TRUE)
clust1 = cutree(hc1, k = 3)

Linkages

Single: \[D(C_k, C_{k'}) = \min \{D(x_k, x_{k'}) : x_k \in C_k, x_{k'} \in C_{k'} \},\] also known as nearest neighbour and is the minimim pairwise distance.

Complete (Furthest Neighbour): \[ D(C_k, C_{k'}) = \max \{D(x_k, x_{k'}) : x_k \in C_k, x_{k'} \in C_{k'} \}, \] also known as further neighbour and is the maximim pairwise distance.

Group Average: \[ D(C_k, C_{k'}) = \frac{1}{|C_k|\,|C_{k'}|} \sum_{x_k \in C_k} \sum_{x_{k'} \in C_{k'}} D(x_k, x_{k'}), \] also known as the unweighted pair group method with arithmetic mean (UPGMA).

Algorithm

To generate the agglomerative dendrogram:

  1. Merge the branches of the clusters with the smallest dissimilarity
  2. Update the dissimilarities relative to the new cluster.
  3. Repeat steps 1–3, until all points are combined in a single cluster.

More dendrograms

Clusters

Summary

  • Covered K-means, K-medoids and Hierarchical Clustering

  • Covered worked example in base R

  • Looked at strengths and limitations

  • Touched on cluster selection